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ABSTRACT 



An examination and comparison of pilot rating scales presently 
in use and an investigation into the possibilities of a linear rating 
scale were conducted. The hypothesis was advanced that a rater 
may transpose his impression of performance directly to a non- 
adjectival, non-ordinal rating scale and thereby relate his psycholog- 
ical continuum to a numerical index. Experimental data, though 
limited, tended to support this hypothesis. 
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I. INTRODUCTION 



With the advent of flight vehicles with operating envelopes ranging 
from terra firma to the threshold of space and beyond, the environ- 
mental and dynamic spectrums encountered on a single flight are 
all-encompassing. Man is the low frequency response component 
[Ref. l] in the overall closed-loop man-machine system, therefore, 
control systems must be designed within manageable limits. In short, 
the effort expended in vehicle control must be minimized so that the 
pilot may be free to complete other duties in the cockpit. 

Consequently, the suitability of a machine system to serve its 
intended mission is ultimately determined by a series of evaluations. 
The most difficult of these assessments occurs at the man- machine 
system interface. 

Pilot evaluation of handling qualities determines the suitability 
of the machine system, yet there remains to be found a set of 
universally acceptable parameters for this evaluation. The complete 
nature of a pilot's task, work load, mental stress and acuity have 
not been described in any form of analytically determined transfer 
function or performance index [Ref. 2]. It is assumed, however, 
that there exists a relationship betweeA pilot comment and perfor- 
mance and/or vehicle handling qualities. 

Efforts to standardize the qualitative aspects of language into 
a quantitative handling quality rating have been made. It is the 
purpose of this study to examine and compare the rating scales 
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presently in use and to investigate the possibilities of a linear rating 
scale with its inherent advantages. 
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II. HISTORY 



A. EARLY DEVELOPMENTS 

During the early 1930’s when aviation was maturing, the need to 
delineate acceptable aircraft parameters was recognized. Consequently, 
a “check list” for this purpose was proposed by Edward P. Warner 
[Ref. 3]. Subsequent work by Soule 1 and by R. R. Gilruth at the 
Langley Laboratory of NACA condensed these requirements and a 
set of specifications for military aircraft acceptance eventually 
resulted [Ref. 4]. 

After this initial break-through in establishing aircraft specifica- 
tions, emphasis was placed on devising pilot opinion ratings aimed at 
specific problem areas. The concept of a general pilot rating received 
little attention. 

B. COOPER SCALE 

In 1957 at the annual meeting of the Flight Testing Session, 

Institute of Aeronautical Sciences, Ames’ Chief Research Pilot George 
E. Cooper introduced a generalized pilot rating scale which enjoyed 
immediate and almost total acceptance [Ref. 5]. This epoch scale 
(Fig. 1) synthesized the previous work of NACA Langley and thereby 
provided an authenticated scale which could be applied to any aircraft 
handling qualities evaluation. It was the first rating scale to associate 
the qualitative nature of pilot opinion with a quantitative index. 

In applying this scale, it was recommended that the evaluator 
pay particular attention to question formulation (Fig. 2). The 
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question had to be sufficiently specific so as to minimize interpreta- 
tion and ambiguity. 

The pilot, in answering the question, was required to channel his 
exposure, sensations and reactions into the scale vocabulary by first 
considering four handling qualities categories: Satisfactory, Unsatis- 

factory, Unacceptable and Unprintable. As may be noted from Figure 1, 
these categories were separated, for description purposes, at the 
approximate values 3. 5, 6. 5 and 9. 5 respectively. Within each cate- 
gory, the pilot was required to further define his opinion in terms of 
the scale vocabulary and a secondary mission (landing). 

Once the pilot had formulated his opinion with respect to the scale, 
his evaluation had to be weighted in consideration of his viewpoint, 
experience and adaptability. For example, a patrol pilot might evalu- 



ate the stall-associated buffet and departure in 



^ r ~~ M T T . 



table-Dangerous" (numerical rating 8); whereas, a fighter pilot might 
evaluate the same characteristics as "Satisfactory, but with some 
unpleasant characteristics" (numerical rating 3). Then, with some 
exposure, the same two pilots might reevaluate the characteristics 
at 4 and 2 respectively. The rating scale was, therefore, very subject 
to experience and adaptability. To eliminate this deficiency and to 
provide some measure of consistency, it was suggested that the scale 
be used only by test pilots. 

Though the Cooper Scale had claim to primacy, it was ambiguous 
in its definitions and complicated in that it placed stipulations on pilot 
opinion. It would appear that the scale was designed to evaluate 
aggregate handling qualities. 
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C. HARPER SCALE 



Robert P. Harper, Jr. used a pilot opinion scale (Fig. 3) for 
evaluating the handling qualities of a variable stability aircraft in 
1959 [Ref. 6]. The Harper S cale was developed honoring the stipula- 
tions of question formulation but with a concept quite different from 
the Cooper Scale. Harper was interested in evaluating pilot-vehicle 
performance, but found this extremely difficult because of pilot 
adaptability. Instead, a scale was devised to evaluate pilot opinion 
with respect to alterations in the stability derivatives and thereby 
arrive at a pilot preference: a most suitable aircraft stability. 

To ensure reliability and compensate for scale vocabulary de- 
ficiencies, test pilots wire - recorded their subjective comments 
during the evaluation and recorded their scale rating following each 
evaluation. This was, perhaps, the best aspect of the testing 
procedure. The pilot rating was kept simple and subordinate to the 
subjective evaluation. Because of this reliance on subjective com- 
ments made during the tests, the pilot rating was utilized as a 
cursory index to the evaluation and not as an end in itself. 

In evaluating the handling qualities with respect to the rating 
scale, the pilot considered four handling qualities categories: 
Acceptable and Satisfactory, Acceptable but Unsatisfactory, Unaccept- 
able, and Unflyable. The separation between these categories 
occurred at 3. 5, 6. 5 and 9. 5 respectively. Within each category, 
the pilot further defined his opinion in terms of a single, though 
sometimes ambiguous, adjective (Fig. 3). 
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D. HARPER SCALE ADAPTATIONS 



In contrast to the Cooper Scale, the Harper Scale (often cited 
as the Cornell or CAL Scale because of its extensive use by Cornell 
Aeronautical Laboratory, Inc. ) was designed as an index for evalua- 
ting particular and highly restricted handling qualities. Efforts to 
adapt the CAL Scale to the evaluation of aggregrate handling qualities 
met with varied success. 

One such example was the application made by Michael L. Parrag 
in 1967 [ Ref. 7] in studying the effects on handling qualities of higher- 
order response characteristics against a background of varying 
conditions and associated mission tasks. 

To facilitate more reliable and consistent pilot comments, the 
test pilots were provided with a comment check list for the two flight 
conditions (Fig. 4), and instructed to make subjective comments 
following each test run. After all tasks were completed, a compre- 
hensive subjective report was required incorporating all the salient 
features of each configuration. Finally, an objective report using 
the comment check list was made. 

Here, as in Ref. 4, emphasis was placed on subjective comments. 
Task-oriented objective comments were used to provide consistency 
and point out features of each task which might otherwise have been 
overlooked. Although the CAL Scale was used as an index to pilot 
opinion, it was, for all practical purposes, insignificant in evaluating 
the handling qualities investigated. 

E. COOPER-HARPER SCALE 

With wide and independent usage of the Cooper and Harper Scales 
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1. IS THE AIRPLANE DIFFICULT TO TPIM? 

2. IS ATTITUDE CONTPOL SATISFACTORY? 

TENDENCY TO OVERCONTPOL? 

3. IS MAINTAINING ALTITUDE A PROBLEM? 

a) STRAIGHT AND LEVEL 

b) TURNS 

4. IS MAINTAINING AIRSPEED A PROBLEM? 

5. WERE GLIDE SLOPE F.PRORS EASILY CORRECTED? 

WAS IT DIFFICULT TO MAINTAIN GOOD GLIDE 
SLOPE CONTPOL? 

6. WHAT INSTRUMENTS ARE YOU USING MOST? 

7. COULD YOU MAKE AM INSTRUMENT LANDING APPROACH 
WITH THIS CONFIGURATION AT THE SPEED OF 125 KNOTS? 

8. PILOT RATING - ADJECTIVES - NUMBER - WHY? 



HIGH ALTITUDE COMMENT CARD 

1. IS THE AIRPLANE DIFFICULT TO TRIM? 

2. IS ATTITUDE CONTROL SATISFACTORY? TENDENCY TO OVF.F - 
CONTROL? 

3. IS NORMAL ACCELEFATION CONTROL A PROBLEM? 

4. IS HOLDING ALTITUDE A PFOBI.DI? 

a) STRAIGHT AND LEVEL 

b) TURNS 

5. ARE THERE ANY DIFFICULTIES IN FLIGHT PATH CONTPOL 
DURING THE CLIMBING AND DESCENDING TURNS? 

6. ARE THERE ANY PPOBLEMS ASSOCIATED WITH THE 
TRACKING TASK? 

7. PILOT RATING - ADJECTIVES - NUMBER - WHY? 



PILOT COMMENT CARDS 
FIGURE 4 



15 



the problems previously cited for each were sources of confusion in 
application. It became increasingly apparent that an acceptable 
composite rating system incorporating the best features of each 
scale would be advantageous. 

To this end Cooper and Harper jointly advanced a revised rating 
scale in 1966 [Ref. 8]. This scale (Fig. 5), hereafter referred to 
as the Cooper-Harper Scale, enjoyed general acceptance and prefer- 
ence over the previous scales; however, the various implementing 
institutions voiced a need for clarification in semantics and in applica- 
tion. In 1969 an explicitly comprehensive joint report was published 
to modify and clarify the Coope r- Harpe r Scale [Ref. 9]* The report 
precisely defined flight evaluation terminology and discussed the 
aspects of question formulation and scale data application. 

Based on the voluminous data and comments available from 
international audiences of the Cooper and Harper Scales, the Cooper- 
Harper Scale was excellently designed as a dichotomous procedure 
of evaluation. A pilot, in evaluating a handling quality, systematically 
chose between two alternatives which channeled his consideration into 
a rating category or into another dichotomous decision with the same 
channeling result. Through this simplified procedure (compare with 
the relative complexity of previously discussed procedures) three of 
four existing categories were eliminated without ever considering 
the applicable descriptive adjectives. 

The inverted ten-point scale was retained in the interests of 
consistency. An ordinal sequence varying in magnitude with the 
degree of "goodness” would seem more appropriate; however, 
audiences of the previous scales had become accustomed to the 
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COOPER-HARPER SCALE 



inverted scale and a reordering of the numerical indices would have 
resulted in unnecessary confusion. To further ease the transition 
from previous scales, the boundaries of 3. 5, 6. 5 and 9. 5 were 
retained. 

It would appear that a satisfactory method for assessing the 
man-machine interface had been achieved; but not quite. Although 
the Cooper-Harper Scale continues to be the most widely used 
evaluation system to date, it remains insensitive at the bad end and 
does not exhibit the desirable feature of linearity. Linearity is that 
feature of a rating scale which will allow the averaging of data 
ensembles without distorting the data sample interpretation. 

f. McDonnell scale 

In 1968, J. D. McDonnell published his study of rating techniques 
[Ref. 10]. His objective was to evolve a rating scale which had an 
underlying linear structure to facilitate mathematical operations on 
pilot data. This underlying structure required the discipline of 
psychophysics for determination. 

Although a detailed examination of psychophysics is beyond the 
scope of this study, the basic theory is presented for clarification. 

If an objective measure is made upon some object, the resulting data 
must lie along some physical continuum. If an evaluator estimates 
a measure, the measure is subjective and must lie along some 
psychological continuum. The relationship between these two con- 
tinua, if it could be determined, would provide a means of linear- 
izing the subjective scale. 
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To establish an intervaled psychological continuum, a list of 
sixty-four appropriately descriptive phrases were randomly sub- 
mitted to sixty-three raters. For each phrase, the raters were 
instructed to indicate their impression of a hypothetical vehicle 
so described on a plot with the end points of "most favorable" and 
"least favorable". The data were then processed by the methods 
of psychophysics and successive intervals and assigned a relative 
standing on a scale of nine. The data were further reduced to the 
arbitrary seven-point McDonnell Scale depicted in Figure 6. 

The McDonnell Scale (often called the Global Rating Scale be- 
cause it related aggregate handling qualities) was, therefore, 
presumed to be a linear scale of constant subjective sensitivity 
reflecting the resolving power of raters to distinguish semantic 
differences. Because it was related to a seven-point scale in 
contrast to the ten point scales with which users were familiar, it 
was not accepted with any noticeable exuberance. 

The truely important contribution made by McDonnell was the 
list of evaluation phrases related to an index of nine and reflecting 
psychological sensitivity. The phrases were divided into six 
categories: Handling Qualities, Control, Precision, Response 

Characteristics, Effects of Deficiencies, and Demands on Pilot. 
Through the use of this listing, specialized linear scales may be 
constructed to satisfy particular rating requirements. 
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See APPENDIX B. 
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G. CONTEMPORARY RESEARCH 



In designing the washout circuitry for the Ames All-Axis Motion 
Generator, it became necessary and expedient to solicit pilot opinion 
in determining the n best M set of parameters to use in a given con- 
figuration. To this end, S. F. Schmidt and Bjorn Conrad [Ref. ll] 
used three non-ordinal, relative rating scales in their evaluations 
(Fig. 7-a, b, c). 

The questions related to each scale were particularly tailored to 
the descriptive adjectives shown and they were simple in nature. By 
using pilot comments as an index, the design providing the best over- 
all simulator characteristics was obtained. However, moderate 
changes in the washout circuitry initially selected did not alter pilot 
opinion during subsequent testing. 

It would appear that one or both of the following factors were 
responsible for the inability of rating pilots to distinguish minor 
changes in simulator characteristics: 

1. The evaluation task was insensitive to minor changes 
in system response 

2. The rating scale adjectives were too widely separated 
on psychological continuum. 

3 

During a personal interview Conrad discussed the work on which 
he had reported in Ref. 11. In determining the best washout circuitry 
the pilot ratings extracted from his scales were heavily supplemented 



That servo circuitry of an all-axis motion simulator which 
provides for returning the simulator to its initial position after being 
disturbed. It is important that this function be executed at a rate 
below a pilot’s sensing threshold. 

3 

Interviewed on 10 May 1971 at Analytical Mechanics Associates, 
Inc. of Palo Alto, California. 
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with debriefs. It was primarily through this method of pilot interview 
that the best washout circuitry was obtained. 

He observed that pilots rapidly adapted to minor configuration 
changes without altering their rating, and he described this lack of 
sensitivity as a rating plateau (Fig. 8). 




Performance 
PILOT RESPONSE 

Figure 8 



He additionally noticed that a pilot’s impression of his mean 
performance changed from day to day. This, therefore, required 
that at least one test run utilizing the 11 standard” washout circuitry 
be conducted to reestablish the pilot's mean performance, a time- 
consuming and costly procedure. 

Conrad's present work, an extension of that above, tasks pilots 
with flying formation on the television display of a six-degree of 
freedom simulated tanke r aircraft. It is hishope that this relative 
position task will prove to be sufficiently sensitive and thereby 
provide reliable pilot ratings on the scale depicted in Figure 7-d. 

H. SUMMARY 

The rating scales which have been reviewed fall into the two 
categories, as distinguished according to purpose, of aggregate and 
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relative handling qualities evaluations. The first category consists 
of the Cooper and Cooper-Harper Scales; whereas, the latter consists 

of the Harper, McDonnell and Conrad Scales. 

4 

During a personal interview Cooper related the circumstances 
stimulating the evolution of his Scale. While evaluating a variable 
stability F6F Wildcat the project engineers had an understandable 
tendency to mathematically manipulate the flight data in the course 
of its reduction; however, the conclusions derived therefrom did not 
necessarily reflect the pilot’s interpretation of the actual handling 
qualities encountered. To eliminate this inadvertent misinterpreta- 
tion of flight data, the Cooper Scale was designed. 

When Cooper presented his Scale at the annual meeting of the 
Institute of Aeronautical Sciences it was immediately accepted and 
internationally implemented as an aggregate evaluation scale. Though 
the Cooper Scale was not designed for this purpose, international 
usage determined its application. 

In the collaborative effort to develop the Cooper-Harper Scale, 
Harper advocated a relative evaluation scale; however, the various 
implementing institutions preferred a scale applicable to aggregate 
evaluations and the dichotomous scale resulted. 

The Harper and Conrad Scales were obviously designed to 
evaluate relative handling qualities and no further discussion is 
neces sa ry . 

The McDonnell (or Global) Scale was designed as an aggregate 
rating scale; however, because of its syntactical simplicity it could 



Interviewed on 10 May 1971 at the Ames Research Laboratory, 
NASA, NAS Moffett Field, California. 
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be applied only to relative evaluations (see Fig. 6). The sixty-three 
psychologically intervaled phrases resulting from McDonnell's re- 
search, however, were applicable to both aggregate and relative 
handling qualities evaluations. 

In evaluations utilizing any of the rating scales except the Cooper- 
Harper Scale subjective pilot comment was required to provide 
meaningful evaluation data. 
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III. HUMAN RESPONSE 



A. INTRODUCTION 

The Cooper-Harper Scale was excellently designed and remains 
the best aggregate rating scale in existence because of its dichotomous 
nature and its acceptance as the international standard. However, it 
was specifically designed so as not to facilitate the averaging of 
ratings [9]. 

With the advent of greater sophistication in aircraft research 
and development, it has become increasingly important to evaluate 
the relative "goodness" of aircraft components and subsystems. It 
is assumed that a highly desirable aerospace vehicle may be designed 
and built; however, a rating scale capable of reliably determining the 
acceptance or rejection of one highly desirable system over another 
is yet to be evolved. It is the purpose of this section to investigate 
the possibility of such a rating scale. 

For a scale to effectively reflect minor differences in performance, 
extreme sensitivity is desired. The inherent advantages of linearity 
are also desired to facilitate mathematical operations on a limited 
ensemble and thereby suppress research and procurement costs. 

The hypothesis of this investigation is that a linear rating scale 
coincident with the psychological continuum begets sensitivity. The 
psychological continuum was investigated [ 1 0] and resulted in the 
McDonnell Scale, but, as may be noted from Figure 6, descriptive 
adjectives and/or phrases did not align cardinally. This, then, 
provided a source of confusion because the numerical value associated 
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with the adjective might not coincide with the rater* s psychological 

continuum. Were this source of syntactic confusion eliminated, the 

rater could transpose his impression of performance directly to a 

rating scale and thereby relate his psychological continuum to a 

linear numerical index. Additionally, if allowed to fractionalize his 

rating, sensitivity would be limited only by the rater's discriminate 

5 

dispersion and frustrations. 

To investigate this hypothesis, a simple puzzle was selected and 
submitted to the analytically inclined students in the Department of 
Aeronautics of the Naval Postgraduate School. Upon completion of 
the test, or at the expiration of an allotted time, the subjects were 
asked to rate their impression of the difficulty they encountered in 
working the puzzle on three numerical scales. 

B. TEST EQUIPMENT 

The plastic Kohner EVEN-STEVEN solitaire puzzle (Fig. 9) was 
used as the testing device. It consisted of a base with eight equal 
depth holes, eight equal length sleeves with variable interior depths, 
and eight variable length pegs. The puzzle had 40, 320 (eight factorial) 
different solutions, one of which resulted in all pegs being even. 

A standard stop-watch was used for timing, and the scales 
depicted in TABLE 1 were used for rating purpose. 

C. TESTING PROCEDURE 

Before starting the exercise, the subjects were briefed in detail 
regarding the physical characteristics of the puzzle. Prior to each 

5 

The deviation of the resolving power of raters to distinguish 
minor differences in performance. 
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TABLE I 

RATER QUESTTONAIPHE 



NAME 

AGE 



DATE 



You are requested to solve the EVEN -STEVEN puzzle a6 a Kumar Re- 
sponse Section of a Thesis. You will have 60 seconds in which to com 
plete the exercise. After completing please indicate the degree of 
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DIRECTIONS: 
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even (make STEVEN -EVEN) . 

3. Do not look into the sleeves for distance estimations. 
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test the pegs and sleeves were removed from the base and mixed ran- 
domly within a box before the subject. The exercise was started on 
the proctor’s "mark" with the subject’s hands poised over the box. 

At test completion the time was recorded or, if the subject did not 
complete the test in 60 seconds, the number of even pegs, regardless 
of height, was recorded. The elapsed time or number of even pegs 
was the basis for determining performance. 

The subject was then asked to rate his impression of the difficulty 
he encountered in working the puzzle with respect to all three scales 
on the RATER QUESTIONNAIRE (TABLE 1), and to indicate his rating 
in the box provided. This procedure was repeated twice to provide 
for three tests. When subjects inquired as to the degree of difficulty 
associated with scale end points, they were briefed that this deter- 
mination was the rater’s responsibility. By so doing, the rater’s 
personal psychological continuum was enjoined. 

D. RESULTS AND DISCUSSION 

The exercise was administered to thirty-one subjects as outlined 
above, and the raw data were recorded in Appendix A. Of the subjects 
tested, 25 or 80. 8% understood the rating procedure. The remaining 
six failed to rate their impression of the difficulty they encountered 
as evidenced by their constant ratings on each scale, regardless of 
their performance, throughout the testing sequence. Consequently 
these data were discarded because it was impossible to determine 
the linear correlation of a point. 

1 . Question Formulation and Interpretation 

The failure of 19-2% of the subjects to comprehend the rating 
procedure may be the result of incorrectly written rating statements 
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(i. e. , ". . . please indicate the degree of Difficulty you encountered 
while performing the exercise. 11 and "Indicate your impression of 
Difficulty in the box provided. u ) However, these statements were 
combined during the pretesting brief (i. e. , "Indicate your impression 
of the Difficulty you encountered in working the puzzle. "). 

When this 19. 2% was queried regarding their constant rating, all 
replied that the difficulty of the test was a constant regardless of 
their performance. 

Subjects 7, 12 and 14 (TABLE II) all had inappropriately low 
correlation factors because their ratings indicated increased difficulty 
for increased performance. When each was queried, he related that 
more incorrect puzzle combinations were discovered in subsequent 
testing; consequently, his impression of puzzle difficulty increased. 
Although these ratings did not properly reflect the rating statements, 
they were used in the Linea rity section because such deviations may 
be expected in any testing procedure. 

2. Linea rity 

Linear correlation assumes a linear relationship between 
variables. If a series of variables are linearly related, the cor- 
relation factor will be 1. 00. Deviations from linearity will yield 
factors less than 1. 00. 

To facilitate detailed analysis and to justify raw data averaging, 
an individual correlation factor (r) was calculated for each exercise 
subject listed in TABLE II. In correlation factor calculations the 
time to exercise completion or the number of even pegs was used as 
the independent variable, and the subject's rating was used as the 
dependent variable. 
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TABLE II 



RATER PERFORMANCE - RATING 
CORRELATION FACTOR 







CORRELATION 


FACTOPS 


(r) 










SCALE 








SCALE 




SUBJECT 


A 


B 


C 


SUBJECT 


A 


B 


C 


1 


.992 


.993 


.993 


14 


.143 


.142 


.142 


2 


.995 


.983 


.956 


15 


.986 


.999 


.986 


3 


.992 


.997 


.992 


16 


.901 


.998 


.998 


4 


.905 


.905 


.905 


17 


.999 


.97] 


00(1 
• -r ✓ ^ 


5 


.993 


.993 


.993 


18 


.999 


.982 


.929 


6 


.899 


.739 


.897 


19 


.999 


.499 


.999 


7 


.181 


.181 


.181 


20 


.866 


.866 


.866 


8 


.938 


.939 


.939 


21 


.960 


.961 


.961 


9 


.596 


.659 


.596 


22 


.997 


.998 


.998 


10 


.866 


.831 


.866 


23 


.997 


.953 


.976 


11 


.545 


.645 


.600 


24 


.999 


.999 


.999 


12 


.189 


.346 


.453 


25 


.997 


.993 


.968 


13 


.997 


.999 


.999 
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Scales A and B yielded correlation factors of which 90. 9% 
were greater than 0. 8 and 81. 8% were greater than 0. 9. Scale C 
yielded 77% and 72% respectively. The overall correlation factors 
for Scales A, B and C were 0. 928, 0. 905 and 0. 927 respectively. 

This high degree of performance - rating correlation confirmed 
linearity and sensitivity, and was an extremely strong indication 
that raters were able to relate their personal psychological continuum 
to a linear, non-adjectival, non-ordinal rating scale. It additionally 
provided justification for the averaging of ratings. 

Another feature of high correlation is that relatively few 
trials may be conducted with a high degree of confidence in the 
resulting data. This thereby reduces the time and cost expenditures 
associated with testing. 

3. Rating Analys is 

The test subjects 1 ratings fell into two groups as characteri- 
zed by those who completed all tests during the allotted time (Group 
X) and those who completed two or less tests (Group Y). As indicated 
in Figure 10, Group X experienced less difficulty than Y throughout 
the testing sequence; however, the rating curves of Group X reflected 
decreased learning in contrast to the curves of Group Y. 

It should be noted that the rating curves of Group Y did not 
remain parallel as did those of Group X. This was, perhaps, an 
indication of the frustration experienced in not being able to complete 
each test. Such a factor would influence rating accuracy and, con- 
sequently, rating sensitivity. 

By averaging the unweighted corresponding test ratings of 
both Groups (there were more subjects in Group Y), Figure 11 was 
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constructed. As may be observed, the average rating curves ranged 
about the numerical mean of each scale, and, in fact, the average 
ratings of Scales A, B and C were 5. 00, 2. 02 and 5. 02 respectively. 

Considering these two facts, it must be assumed that the 
test subjects discarded any ’’degree of difficulty” associated with 
the scale end points and related all of their ratings to the scale 
numerical mean. Consequently, all rating was a matter of judgment; 
a matter of relating their psychological continuum to the scales 
presented in TABLE I. Whether test subjects consciously or sub- 
consciously related to the scales 1 numerical means was beyond the 
scope of this study; 

4. Scale Preference 

Of the 31 test subjects, 28 preferred Scale A, two preferred 
Scale B, and one preferred Scale C. It was interesting to note that 
Scale A construction paralleled that of the Cooper-Harper Scale 
(i. e. , increasing numerical index with increasing degree of ’’badness” 
however, only 35% of the test subjects had ever been exposed to the 
Cooper-Harper Scale. Because the subjects were enrolled in a 
mathematically oriented curriculum, the preference for a decimal 
system based on ten seemed appropriate. As evidenced from the 

overall correlation factors and the Scale average ratings, the pre- 

t 

ference for Scale A appeared valid. 

The limited preference for Scale B was believed to reflect 
exposure to the 4. 0 Navy system. The preference for Scale C was 
believed to have been made in the interests of inconsistency and 
levity. 
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5. Recommendations 



Because of the high correlation experienced during this 
investigation and the preference of raters for a decimal scale based 
on ten, it is recommended that such a scale be used in all relative 
rating evaluations. 

Although the Cooper-Harper Scale is unequivocally accepted 
for its designed purpose, it could be improved if the scale advanced 
herein were used to evaluate the relative "goodness" within a Cooper- 
Harper ordinal category. For example, once an ordinal category 
were determined via the dichotomus procedure, the category could 
be further defined by Scale A utilization. To designate such a refine- 
ment, the first number of a series could reflect the non-ave ragable 
Cooper-Harper rating and subsequent numbers reflect Scale A 
(i. e . , 1.2.25). 

E. CONCLUSIONS 

The purpose of this study was to examine and compare the rating 
scales presently in use and to investigate the possibilities of a linear 
rating scale. 

A review of rating scale development and a study of the current 
rating scales were presented in Section II. Section II also provides 
an organized source of information for the rating-scale novice that 
may be used to develop specialized rating scales. 

Section III advanced with some substantiation the hypothesis that 
a rater may transpose his impression of performance directly to a 
non-adjectival, non-ordinal rating scale and thereby relate his 
psychological continuum to a linear numerical index. Twenty-five 
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test subjects utilized such a scale and 81, 8% had correlation factors 
in excess of 0. 953 during three tests. 

The use of a non-adjectival, non-ordinal scale could provide 
simplicity, linearity, averaging, high correlation and a high confidence 
for minimum testing. Such a scale, if used in contemporary testing, 
might greatly reduce evaluation and procurement costs. 



t 
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*Score based on a total of eight 
**Scale preference 
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* Score based on a total of eipht 
**Scale preference 



APPENDIX B 
(Continued) 

PHRASE 



Response Characteristics 

Excellent, pure (i.e., no accidental 
excitation) primary and secondary 
response charac t eristics 
Good, relatively pure, primary and secondary 
response characteristics 
Fair, somewhat impure primary or secondary 
response characteristics 
Quite sensitive, sluggish or uncontrollable 
in primary or secondary responses 
Extremely sensitive, sluggish or uncontrollable 
in primary or secondary responses 

Effects of Deficiencies 

Effects of deficiencies on performance is easily 
compensated for by pilot 
Some minor but annoying deficiencies 
Moderately objectionable deficiencies 
Major, very objectionable deficiencies 

Demands on Pilot 

Completely undemanding of pilots, very relaxed 
and comfortable 

Largely undemanding of pilots, relaxed 
Mildly demanding of pilot attention, skill 
or effort 

Demanding of pilot attention, skill or 
effort 

Very demanding of pilot attention, skill or 
effort % 

Completely demanding of pilot attention, 
skill or effort 



PSYCHOLOGICAL 

MEAN 

0.99 

2.47 

4.62 

6.00 

7.10 

4.04 

4.50 
5.57 

7.65 

1.65 

2.36 

4.22 

5.88 

7.50 

8.36 
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APPENDIX B 



LIST OF EVALUATION PHRASES 
PHRASE 



PSYCHOLOGICAL 

MEAN 



Handling Qualities 

Excellent handling qualities 1.00 

Highly desirable handling qualities 1.47 

Good handling qualities 2.58 

Pleasant handling qualities 2.65 

Fair handling qualities 4.13 

Bad handling qualities 7.74 

Very bad handling qualities 8.22 

Control 

Extremely easy to control with excellent 

precision 0.97 

Very easy to control \:i th good precision 1.76 

Easy to contrul with fair precision 3.21 

Controllable with somewhat idadequate preeision 5.43 

Controllable, but only very imprecisely 6.65 

Difficult to control 7.18 

Very difficult to control 8.15 

Nearly uncoltrollable 8.91 

Precision 

Extremely easy to control with excellent 

precision 0.97 

Very easy to control with good precision 1.76 

Easy to control with fair precision 3.21 

Controllable with somewhat idadequate 

precision 5.45 

Controllable, but only very imprecisely 6.65 
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